From 834d85c454d4a84b6ab8f61fa0109c706143e673 Mon Sep 17 00:00:00 2001 From: Spencer Date: Sun, 5 Oct 2025 19:26:45 +0000 Subject: [PATCH] new idea, work in progress --- ..._-_Connecting_Projects_to_Share_Files.mdwn | 72 +++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 doc/tips/Acquaintances_-_Connecting_Projects_to_Share_Files.mdwn diff --git a/doc/tips/Acquaintances_-_Connecting_Projects_to_Share_Files.mdwn b/doc/tips/Acquaintances_-_Connecting_Projects_to_Share_Files.mdwn new file mode 100644 index 0000000000..00048cb64e --- /dev/null +++ b/doc/tips/Acquaintances_-_Connecting_Projects_to_Share_Files.mdwn @@ -0,0 +1,72 @@ +# Acquaintances: Sharing Files through Connected Projects + +I often connect repos together during my scientific work, in which I like to use the [YODA (Datalad)](https://handbook.datalad.org/en/latest/_images/dataset_modules.svg) standard of connecting related projects via submodules. However, I've recently found that sometimes I have to connect an entire repo to, say, a paper just to use one resource. For the sake of provenance, this connection is essential, but it feels extremely inefficient and unscalable to have one repo filled with submodules just for individual files. + +For these specific instances, I'm devising an alternative solution: acquaintance repos. + +## Acquaintances are Unrelated Repos + +In general, an acquaintance is a repo whose *history* (branches, worktree, commits) is not relevant to the current repo, but is the origin for some files that the current repo uses. This is unlike *clones* (where everything is related), *parents/children* (where the entire child is derived or related to the parent, e.g. like superproject team repos and their children), or other [groups](https://git-annex.branchable.com/preferred_content/standard_groups/) defined by git-annex (archives, sources, etc.) + +This definition requires upholding some technical details: + +1. Acquaintances should **never sync**. This precludes defining them as normal git remotes unless you are very dilligent about undefining `remote..fetch` and setting `remote..sync=false` +1. Acquaintances don't need to know about *all* files in the acquaintance repo (neither in a git sense or annex sense), just the files used. Therefore `git annex filter-branch` is a bit overkill, but could be done manually via selecting exactly the keys needed. + +## Solution - A Special Remote with Custom Groups + +(`gx` is short for `git annex`) + +Define a special repo that points to the primary storage location for the acquaintance repo. +I like to define it with a name like `acq.X` so it's obvious by inspection that it's an acquaintance. +Other metadata also tells you this (`gx group acq.X` will list `acquaintance`, or something could be added to the description), +but being in the name makes it clear especially for e.g. `gx list`. + +### Depot: Primary Storage + +The depot is where a repo stores its *own* stuff. +This prevents others' stuff from being duplicated into the referencing repo. +For those familiar with the `client` group, `depot`s are just clients with acquaintances replacing archives. + +`gx groupwanted depot "(include=* and (not (copies=acquaintance:1))) or approxlackingcopies=1"` + +### Acquaintance + +The acquaintance is the source for stuff the current repo references. +Therefore, it doesn't need to be stored by the repo (i.e. in its depot) + +`gx groupwanted acquaintance present` + +### Finishing Up + +To actually register where acquaintance files are, the ideal way is `gx fsck`. +This is better than e.g. `gx filter-branch` mentioned above because it's automatic. +The default behavior of `fsck`, like other annex commands, is to check against files *in the current worktree*, +so it will only populate the metadata for a special remote about the files the current repo is trained to care about. + +`gx fsck -f acq.X -J 10` + +This may be a bit slow initially because it has to check each file in the worktree by seeking the remote, downloading known files, and verifying their hashes before they're registered as present in the new acquaintance. + +In short the process involves: + +1. For every external file desired by a repo: + 1. Copy the file (or a symlink) to the current repo and track it with annex + 1. Define a new special remote `acq.X` pointing to the depot/storage location for the file from the acquaintance repo. + 1. Assign the special remote with group `acquaintance` + 1. Assign any storage locations for the current remote with group `depot` + 1. Run `gx fsck -f acq.X` to populate the new special remote's contents relative to the current repo's worktree/branch + 1. Run `gx sync` if desired. The result should be files present in the current repo (if desired), and only in the acquaintance but not the depot(s). + 1. Now, the acquaintance acts as a link back to the origin for referenced files without duplication or having to add the entire acquaintance as a submodule! + +## FAQ/Open Questions + +1. Is there a way to define the custom groups globally, or will I have to re-define special groups in every repo that uses acquaitances/depots? + 1. Not sure yet. I wonder where custom groups could be defined globally? Maybe in the user `.gitconfig`. +1. Is there a way to get CLI autocomplete to suggest custom groups? + 1. Not sure yet. +1. Will this play well with standard groups and the assistant, especially if `client`s and `archive`s are used? + 1. Probably not, I don't use the assistant, but I suspect if one wanted to they'd have to define depots as clients with the acquantaince logic added instead of substituted for archives. + + + -- 2.30.2